HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Question Answering in Vietnamese
NGUYEN THI MUNG
Mung.NT211261M@sis.hust.edu.vn
School of Information and Communication Technology
Supervisor:
PhD. Nguyen Thi Thu Trang
School:
Information and Communication Technology
Hanoi, 04/2023
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Question Answering in Vietnamese
GUYEN THI MUNG
Mung.NT211261M@sis.hust.edu.vn
School of Information and Communication Technology
Supervisor:
PhD. Nguyen Thi Thu Trang
School:
Information and Communication Technology
Hanoi, 04/2023
Supervisor’s signature
CNG HÒA XÃ HI CH NGHĨA VIỆT NAM
Độc lp T do Hnh phúc
BN XÁC NHN CHNH SA LUẬN VĂN THẠC SĨ
H tên tác gi luận văn: Nguyn Th Mng
Đề tài luận văn: Hỏi đáp tiếng Vit t động
Chuyên ngành: Khoa hc d liu
Mã s SV: 20211261M
Tác giả, Người hướng dn khoa hc Hội đồng chm lun văn xác
nhn tác gi đã sửa cha, b sung luận văn theo biên bản hp Hội đồng ngày
28/04/2023 vi các ni dung sau:
1. Làm rõ đóng góp ca luận văn
- Trong bn sửa đổi, tác gi đã trình bày ràng hơn v những đóng góp
ca luận văn thông qua các nội dung sau:
Tác gi b sung mc 2 Những đóng p của luận văn (Thesis’s
contributions) trong chương kết lun (trang 46).
- Tác gi trình bày rõ v tính cn thiết trong vic cn có mt quy trình xây
dng d liệu đối vi bài toán hỏi đáp, trong đó xem xét đầu vào ging
nói (trang 17-18).
- Tác gi b sung làm rõ cách thc đánh giá d liu hun luyn và d liu
kim th trong quy trình đề xut (trang 28-29, 31-32).
2. Làm rõ các gii thut s dng
- Tác gi đã tả hơn về cách thc gii quyết bài toán hỏi đáp theo
hướng phân loại văn bản bng vic ch rõ cách to ra các nhãn lp tìm
ra câu tr li khi tìm đưc nhãn lp ca câu hi (trang 36).
- Tác gi trình bày rõ ràng hơn về cách tìm câu tr li s dng mô hình so
sánh độ tương đồng và cách thức để đánh giá hai hướng tiếp cn trên cùng
mt thang đo (trang 38-39).
3. Gii thích thêm thc nghiệm cũng như lựa chn thut toán
- Tác gi đã bổ sung các siêu tham s ca các hình s dụng trong ng
phân loại văn bản, bao gm Random Forest, SVM, LSTM, PhoBert và mô
hình so sánh độ tương đồng Sbert. (trang 39 - 41).
- Các tham s trong các hình học sâu đều được tác gi gi nguyên so
vi hình gc. Vi các hình hc máy kết hp tuning trong quá
trình hun luyện để t động tìm ra b tham s phù hp nht.
4. B các phn trình bày không cn thiết
- Tác gi đã lược b phn trình bày v nhóm câu hi (trang 18 quyn
Luận văn chưa chỉnh sa và trang 19 quyn Luận văn đã chnh sa).
- Tác gi đã lược b thut toán Naïve Bayes trong Luận văn trước đó (trang
7 ca 2 bn Luận văn và trang 39, 41 quyn Luận văn chưa chỉnh sa và
trang 43-44 quyn Luận văn đã chỉnh sa)
5. Chnh sa cách trình bày công thc
- Tác gi đã chỉnh sa cách trình bày công thc theo mu của trường (trang
8 15, 42 - 43).
6. Hiu chnh li chính t
- Tác gi đã thực hin soát li chnh sa các li chính t trong Lun
văn.
7. B đánh số chương chương 1 (Introduction) chương 5
(Conclusion)
- Tác gi đã thực hin b đánh số chương với chương 1 chương 5. C
thể: chương 1 chuyển t “Chapter 1. Introduction” thành “Introduction”,
chương 5 chuyển t Chapter 5. Conclusion and future works” thành
“Conclusion and future works” (trang 1 và trang 46).
(Thông tin v s th t ca các trang trong biên bn này nếu không chú
thích thêm tương ng vi s trang trong quyn Luận văn đã chỉnh sa.)
Ngày tháng năm 2023
Giáo viên hướng dn
Tác gi lun văn
CH TCH HI ĐNG
GRADUATION THESIS ASSIGNMENT
Name: Nguyen Thi Mung
Phone: +84.394.338.777
Email: Mung.NTN211261M@sis.hust.edu.vn; mungyp98@gmail.com
Class: CH2021A
Affiliation: Hanoi University of Science and Technology
Nguyen Thi Mung - hereby warrants that the work and presentation in this thesis
performed by myself under the supervision of PhD. Nguyen Thi Thu Trang. All
the results presented in this thesis are truthful and are not copied from any other
works. All references in this thesis including images, tables, figures, and quotes
are clearly and fully documented in the bibliography. I will take full responsibility
for even one copy that violates school regulations.
Student
Signature and Name
Nguyen Thi Mung
ACKNOWLEDGEMENT
First of all, I would like to express my sincerest and deepest gratitude to
Ph.D. Nguyen Thi Thu Trang. With both her enthusiasm and patience, she helped
me orient the topic, gave suggestions, instructed me in detail, and created the best
conditions for me to complete this thesis. She is like a warm mother but also strict,
sometimes like a friend so that we can easily confide and share our difficulties.
Under her guidance, I feel that I have improved a lot.
I would like to express my sincere thanks to the leadership and teachers at
Hanoi University of Science and Technology in general and in the School of
Information and Communication Technology in particular, for giving me the
opportunity to study in a new environment. useful and memorable school in my
student life.
I also want to thank my brothers and sisters, friends, students in laboratory
914, and partners. Thank you to everyone for their detailed guidance, enthusiastic
help and encouragement during my time in the laboratory as well as during my
thesis work. Along with that, I would like to thank my friends inside and outside
the School of Information and Communication Technology for their interest,
sharing and help in the past time.
Finally, I would like to express my sincere thanks to my family. I thank my
family for always loving, caring, and being a spiritual support, a great source of
motivation for me to overcome my difficulties and challenges. I thank Mr. Tuan -
my love, for being there to encourage me in the most stressful times. In the process
of making my graduation thesis, even though I have tried my best, it is still
inevitable that mistakes will be made. I look forward to receiving suggestions from
teachers and friends so that I will not encounter these errors in the future.
Once again, I sincerely thank you!
ABSTRACT
Finding information is becoming more and more challenging as the amount
of knowledge on the Internet is getting bigger and bigger. Conventional search
engines only return lists of short paragraphs or related links, which makes it
difficult for users, especially those who lack experience and search skills.
Therefore, it is essential to build a Question Answering system capable of quickly
giving an accurate answer to a question. Therefore, the author proposes the topic
"Question Answering in Vietnamese" with the goal of building a question-
answering system applicable to Vietnamese, especially considering the input factor
through the human voice. Previous studies have solved the problem with many
different approaches, in which, the approach of using similar questions helps to
store and deploy the system easily. Data is an important factor in helping ensure
output for the system. The thesis proposes a process of building data based on
similar questions including 02 main steps: collecting data through two systems
named Written Collection System and Speech Collection System and applying this
process in building data with the Digital Transformation domain with the initial
question-answer pairs provided by the Ministry of Information and
Communication. Based on the built data, the thesis also evaluates the question
answering models in two approaches: classification and comparing the similarity
between questions. The results show that the models have high accuracy from 82-
94%. In which, the SVM model has the highest accuracy. At the same time, the
model size is not too large and the prediction time is fast, which is suitable for
deployment in practice. The evaluation results also show that, the Automatic
Speech Recognition (ASR) module affects the quality of the model by 3.9 to 10%.
In the future, the thesis aims to expand the initial questions based on the available
documents, and at the same time, partially automate and create tools to support the
data quality controller to evaluate the data for the model.
TABLE OF CONTENT
INTRODUCTION................................................................................................ 1
1. Problem Formula ........................................................................................... 1
2. Goal and scope ............................................................................................... 5
3. Solution orientation ........................................................................................ 5
4. Outline............................................................................................................ 6
CHAPTER 1. THEORETICAL BACKGROUND ........................................... 7
1.1 Text classification algorithms .................................................................... 7
1.2 BERT language model ............................................................................. 12
1.3 Feature extraction..................................................................................... 15
CHAPTER 2. BUILDING A VIETNAMESE QA DATASET ...................... 17
2.1 Overview .................................................................................................. 17
2.2 The process of building a Vietnamese QA dataset .................................. 18
2.3 Data analysis ............................................................................................ 32
2.4 Data Disclosure ........................................................................................ 35
CHAPTER 3. VIETNAMESE QUESTION ANSWERING MODEL,
EXPERIMENT AND EVALUATION ............................................................. 36
3.1 Vietnamese Question Answering problem .............................................. 36
3.2 Experiment setup ...................................................................................... 39
3.3 Results and evaluations ............................................................................ 43
CONCLUSION AND FUTURE WORKS ....................................................... 46
1. Conclusion.................................................................................................... 46
2. Thesis’s contributions .................................................................................. 46
3. Future works................................................................................................. 47
REFERENCES ................................................................................................... 48
LIST OF TABLES
Table 1 Some examples of questions in the initial dataset ................................. 19
Table 2 Information about question length ......................................................... 20
Table 3 Some examples of short and long questions .......................................... 21
Table 4 Special questions .................................................................................... 24
Table 5 Data collection campaigns information ................................................. 32
Table 6 Information about data collected through campaigns ............................ 33
Table 7 Model’s hyperparameter ........................................................................ 40
Table 8 Confusion matrix .................................................................................... 42
Table 9 Evaluation results of experimented models ........................................... 43
Table 10 Size and average prediction time ......................................................... 44
LIST OF FIGURES
Figure 1 Approaches to the QA problem. ............................................................. 2
Figure 2 An example of a decision tree. ............................................................... 7
Figure 3 A biological neuron. ............................................................................. 10
Figure 4 An artificial neuron. .............................................................................. 10
Figure 5 The structure of the RNN network. ...................................................... 11
Figure 6 The architecture of LSTM. ................................................................... 12
Figure 7 BERT, OpenAI and ELMo. .................................................................. 13
Figure 8 Masked LM. .......................................................................................... 13
Figure 9 An example of the input in the BERT model. ...................................... 14
Figure 10 One-word CBOW model structure. .................................................... 16
Figure 11 Data building process.......................................................................... 26
Figure 12 Written data collection process. .......................................................... 27
Figure 13 The data collection interface. .............................................................. 28
Figure 14 Training data evaluation process. ....................................................... 28
Figure 15 Speech data collection process. .......................................................... 30
Figure 16 Speech data collection interface. ........................................................ 31
Figure 17 Distribution of the number of words in a sentence with the Written
Collection System. ............................................................................................... 34
Figure 18 Distribution of word count in the data collected by the speech system.
.............................................................................................................................. 34
Figure 19 Text classification model architecture. ............................................... 37
Figure 20 Similarity comparison model Sbert. ................................................... 38
LIST OF ACRONYMS
ASR
Automatic Speech Recognition
QA
Question Answering
FAQ
Frequently Asked Questions
NLP
Natural Language Processing
SVM
Support Vector Machine
LSTM
Long-short Term Memory
1
INTRODUCTION
In this chapter, the thesis presents the reasons for choosing the topic, based on the
analysis of actual needs as well as previous studies on the question-answering
system in Vietnamese and in the world. Along with that, this chapter also gives the
aim, scope of the topic, research orientation, and layout of the thesis.
1. Problem Formula
Finding information is becoming more and more challenging as the amount
of knowledge on the Internet is getting bigger and bigger. Conventional search
engines only return lists of short paragraphs or related links, which makes it
difficult for users, especially those who lack experience and search skills.
Therefore, it is essential to build a question-answering system capable of giving
an accurate answer to a question quickly. Question Answering (QA) is a large
branch in the field of natural language processing (NLP), which takes as input a
question in the form of a natural language question, possibly text. or sound, then
give the corresponding answer [1].
Classification of the QA system
There are many ways to classify QA systems. Based on the data source, we
can divide the QA problem into three main categories: structured data, semi-
structured data, and unstructured data [2]. A knowledge graph is a representation
of structured data. Semi-structured data is usually presented in the form of lists or
tables. And unstructured data is often represented as text in natural language such
as sentences, paragraphs, documents, etc. Based on the domain, the question-
answering system is divided into two main types: open-domain QA system and
closed-domain QA system [2]. The goal of an open-domain system is to answer
questions in many different fields, based on data mining from rich information
sources such as Wikipedia, Web Search,... Meanwhile, a closed-domain system is
geared towards answering a question for a particular domain. The number of
questions in a closed domain system is smaller with limited resources and
participatory construction by a team of experienced experts in that field.
Approaches for solving QA problems
Previous studies have solved the QA problem in many different ways.
According to our knowledge, the approaches to this problem can be divided into 4
main groups as described in Figure 1 [3].
2
Figure 1 Approaches to the QA problem.
Figure 1 describes the approaches that previous studies have used to solve
the QA problem, including (i) the traditional approach, (ii) Information Retrieval
(IR) combined with Machine Reading Comprehension (MRC), (iii) using
knowledge-based (KB) and (iv) based on similar questions with Question
Entailment (QE) [3].
With the first approach, the QA problem is solved by a pipeline consisting of
three main components: Question Processing, Document Retrieval, and Answer
Extraction [1]. First, the user's question will be analyzed and processed by the
Question Processing component. The task of this component is to understand the
user's question and generate the query as input for the next component. At the same
time, this component also exploits the content of the question to be able to provide
useful information such as question type, entities, and important information,
helping to increase the accuracy of the answer extraction process [4]. An 8-step
pipeline in this module includes entity labeling, POS tagging, linguistic trimming
heuristics, dependency parsing, sentiment analysis, and generating patterns for
queries with ranking given by Usbeck, Ngomo, Bühmann, and Unger introduced
in their study [5]. After the question has been analyzed by the first component, the
Document Retrieval component will rely on that analysis to search for related
documents, usually texts or paragraphs based on an IR module or Web Search
3
Engines [6]. Finally, the Answer Processing component will search and return the
final answer based on those documents. To extract the answers, the research is
usually based on the extraction of real information available in the documents [7]
[8] [9], combined with previously analyzed answer-type information. [10] suggests
that some studies generate latent information from answers and questions, and then
use matching technologies such as surface text pattern matching [11] [12], word
or phrase matching [6], and syntactic structure matching [7] [13] [14]. Deploying
a QA system in this approach helps to control the system in a better way, but this
is a rather complicated task because it requires a combination of many natural
language processing and information retrieval technologies.
The development of deep learning technologies has allowed data processing
with a large amount of computation, which makes the research directions of QA
problems based on reading comprehension more widely studied [1]. MRC is the
problem of finding an answer to a question in natural language, based on a given
passage. This passage will be selected among many text fragments in the database
under the evaluation of the IR component, using document querying technologies.
To solve the MRC problem, based on the answers, there are two main research
directions: (I) generating the answer (Generative MRC) and (ii) extracting the
answer from the passage (Extractive MRC). In the first direction, the answer will
be generated automatically based on the input information. This is also how people
read and understand the content of the passage and give their answers, so the
answer will be more natural and close. However, this also makes the construction
of the training data more difficult and the quality assessment of the model more
complicated. Some datasets built for this problem can be mentioned as the English
data set NarrativeQA [15], Natural Questions Dataset [16], and Chinese DuReader
[17]. LSTM [18], ELMo [19] and GPT-2 [20] are popular models that researchers
use to solve MRC problems in this direction. Unlike the first direction, in the
second direction, the answer is part of the input passage. This makes evaluation
easier and data construction less expensive. With the appearance of large data sets
such as CNN/Daily Mail [21], MS MARCO [22], RACE [23], and SQuAD 2.0
[24], the studies following this approach achieved very good results. Notably, the
BiDAF model [25] solves the problem by representing the text with different
levels, or the QANet [26] combines CNN and self-attention. Language models
with encoder-decoder architectures such as BERT [27], XLM-R [28], and T5 [29],
allow encoding of the question and corresponding passage and return the start and
end positions of the answer through the decoder, also achieved high results with
the above data sets. For Vietnamese, there are two open datasets, UIT-ViQuAD
[30] and UIT-ViNewsQA [31] for problems following this research direction. The
research direction of MRC models has shown the ability of computers to exploit
information from documents to give answers. However, building data in this
direction requires a lot of work and effort. The storage and management of
documents are also very expensive.
4
Researches follows a KB approach using structured data, in the form of a
knowledge graph or SQL databases, representing facts and relationships between
entities [32]. Berant et al. built the WEBQUESTION dataset, using the Google
Suggested API to generate actual questions that start with wh and have a unique
entity [33]. However, the questions in WEBQUESTION were still quite simple, so
Berant and Talmor later improved this dataset with
COMPLEXWEBQUESTIONS in [34]. Compared with WEBQUESTION, the
questions in COMPLEXWEBQUESTIONS can have more entities and contain
more types of questions such as compound, association, and comparison,... In
research on building an application query system answering questions related to a
particular product, Frank et al. [35], Li et al. [36] use product attributes to form a
knowledge graph for the system. For Vietnamese, Phan and Nguyen [37] build a
knowledge graph in terms of triads (subject, verb, object) and transform the input
question into the corresponding form. Dat et al. [38] used intermediate
representative elements, including information about question structure, question
type, keywords, and semantic relationships between keywords to build a system
knowledge in its system. Studies in this approach follow two main methods: (i) IR,
searching for answers by sorting possible candidates [39], and (ii) semantic
analysis to convert natural language queries into queries usable in Search Engines
[40]. However, building such a knowledge system requires having a rather
complex knowledge system and requires a lot of effort in maintaining and
expanding that knowledge system.
Finally, the studies follow the QE approach using question-answer pairs as a
source of knowledge for the system. This method builds on the definition of similar
questions, which can be answered in the same way. With a user input question, the
task of the QA problem is to find a question similar to that question and return the
corresponding answer [3]. [41] builds a pair of questions and answers from
Frequently Asked Questions (FAQs) of Microsoft Download Center, combining
AND/OR search techniques and combinatorial search techniques to create a full
list of results for a related search. The CMU OAQA [42] uses a Bidictionary
Recurrent neural network (BiRNN) combined with an attention mechanism to
predict the similarity between two questions. For Vietnamese, T. Minh Triet and
colleagues also published the dataset UIT-ViCoV19QA [43], including 4,500
question-answer pairs related to the Covid epidemic, collected from FAQs on the
official websites of healthcare organizations in Vietnam and around the world.
With this approach, QA systems can easily build and store data, especially in
closed systems to serve the needs of a particular organization. However, data
building in this approach often requires the involvement of experts in that field.
In addition, the input of a QA system can be text or speech. With speech
input, there are two main approaches for the system to understand the user's
question: (i) training the end-to-end model and (ii) modularizing each component
separately. With the first approach, the system is a unit, which takes audio as input
5
and returns the corresponding response. [44] proposes joint audio-and-text
SpeechBERT, a pre-trained model trained from both audio and text data. However,
building high-quality training data for such systems is difficult and expensive. In
the second approach, the system is segregated into independent modules. First, the
ASR module will be responsible for converting the speech to text, then the
generated text will be passed to the next processing module to give the answer.
With this approach, the system has a modular division, so it is easy to control the
quality of each module. However, building such a pipeline system will also be
more complicated.
Based on the analysis of the approaches to the QA problem, and at the same
time, based on the need for building a Virtual Assistant for the Digital
Transformation domain of the Ministry of Information and Communications, the
thesis aims to research and evaluate QA models based on similarity questions,
which considers the input of the system in speech.
2. Goal and scope
Based on the studies in section 1, the thesis consists of two main objectives:
(i) building and enriching datasets on the Digital Transformation domain and (ii)
evaluating QA models based on built data.
With the first goal, the thesis aims to build a process of collecting data for
the QA problem, with the initial data being question-answer pairs in the Digital
Transformation domain. The data after going through this process will be used as
training data and testing data for the corresponding QA model, with the user input
in speech. Although tested and implemented in the Digital Transformation data
domain, the proposed process should be general, extensible, and applicable to other
data domains. This will be a sample process for individuals, organizations, and
subsequent researchers to apply in designing and building their own QA system.
With the data that has been built, the second goal of the thesis is to evaluate
QA models on this dataset. The results of the test will contribute to proving the
quality of the built data and also serve as a basis for evaluating the feasibility of
deploying QA models in practice.
3. Solution orientation
From the objectives given in section 2, our proposed oriented solutions for
the thesis are as follows: (i) building a process of collecting training and testing
data for the QA model and (ii) evaluating QA models based on built data.
The data will be collected in written and spoken form, with the participation
of collaborators guided by experts in the field of Digital Transformation. In
particular, the Written Collection System will support the collection of training
data, and the Speech Collection System will help create the building of test data
for the model.
6
Besides building data, the thesis will also test different models to solve the
QA problem in two main approaches: text classification and similarity comparison.
These are the methods that fit the built data. Based on the experimental results, the
thesis will evaluate the feasibility of this method when deployed in practice.
4. Outline
The rest of the thesis is presented as follows.
Chapter 1 presents an overview of the theoretical background related to the
QA problem, focusing on models of text classification and assessing the similarity
of questions. This is the basis for evaluating the solutions in the next chapters.
Next, Chapter 2 presents studies on data construction based on available
question-answer pairs. Based on those studies, the thesis presents a proposal on the
data-building process to solve the QA problem, specifically applied in the Digital
Transformation domain.
Based on the built data set, Chapter 3 presents the evaluating results obtained
on the built dataset. This will be the basis for considering the feasibility of
implementing the QA model in practice.
Finally, the last chapter will present the conclusions reached by the thesis and
discuss future development directions.
The following are the details of each part of the thesis.
7
CHAPTER 1. THEORETICAL BACKGROUND
In this chapter, the thesis presents the theoretical background used in the
thesis, from which readers can grasp the basic concepts. The theoretical
background presented includes basic classification algorithms, feature extraction,
and the BERT language model.
1.1 Text classification algorithms
Random Forest
Random Forest [46] is a set of many Decision Trees [47]. The number of
trees in the forest can be up to hundreds or thousands of trees. A Decision Tree is
a structured hierarchical tree made up of sequences of rules. In the case of a boy
who wants to play soccer, Figure 2 is an example of a decision tree. Describe a
tree that decides to go soccer or stay at home based on weather, humidity, and
wind.
Figure 2 An example of a decision tree.
Based on Figure 2, we can see that when it's sunny and the humidity is
normal, the boy is more likely to go to soccer. And if it rains and has strong winds,
it is likely that he will choose to stay at home.
8
The key point in building a Decision Tree lies in the Iterative Dichotomiser
3 (ID3) algorithm [45]. Usually, data will have many different attributes. In the
original example, the data will include a lot of information, about the weather
(sunny/rainy), humidity (high/normal), wind strength (strong/light). With these
attributes, ID3 determines their order at each decision step. The best attribute will
be selected through a measurement. A division is considered good if the data at
that step is entirely in one class. On the contrary, if the data is still mixed together
by a large proportion, the division is not really good.
The forest will generate decision trees randomly, depending on the judgments
in the learning process. The final decision results are aggregated based on the
judgments of all the decision trees present in the forest.
Support Vector Machine
Support Vector Machine (SVM) [48] is a linear classification model, used in
dividing data into distinct classes. Consider the binary classification problem with
data points:
󰇝
󰇛
󰇜
󰇛
󰇜
󰇛
󰇜
󰇞
Where
is an input vector represented in the space
and
is the
class label corresponding to the input vector,
󰇝

󰇞
in which
means
the data belongs to the positive class and
 means the data belongs to the
negative class. Formula for calculating labels for a data point
The goal of SVM is to define a linear classifying function between two
classes:
󰇛
󰇜

(Eq. 1.1)
Where is the weight vector of the attributes, and is a real numeric value. Based
on the function 󰇛󰇜, we determine the output value of the model as follows:
󰇝
 
(Eq. 1.2)
Suppose 󰇛
󰇜 is a point in the positive class and 󰇛
󰇜 is a point in the
negative class, closest to the dissecting hyperplane
. Let
be two parallel
hyperplanes, where
passes through 󰇛
󰇜 and is parallel to
and
passes
through 󰇛
󰇜 and parallel to
. The margin is the distance between two
hyperplanes
and
. In order to minimize the error in the classifying process,
we need to choose the hyperplane with the largest margin, such a hyperplane is
called the maximum margin hyperplane.
The distance from
to
is:
9




(Eq. 1.3)
The distance from
to
is:





(Eq. 1.4)
The margin between two hyperplanes
and
is determined by the
Equation 2.5.


(Eq. 1.5)
Therefore, the problem of determining the maximum margin between two
hyperplanes is reduced to determining and so that 
reaches the
maximum value and satisfies the condition:
󰇝




(since
are the closest points to the separation hyperplane and belong to

,

):
󰇛
󰇜

,
󰇝



(Eq. 1.6)
Equivalent to:
󰇛
󰇜


with condition:
󰇛

󰇜
(Eq. 1.7)
The above algorithm is applied in the case of linear data. For non-linear data,
SVM uses kernel functions to transform the data into a new space, in which the
resulting data is linearly discriminant. Some common multiplication functions can
be mentioned as linear, polynomial, Radial Basis Function (RBF), sigmoid, [45]
With the multiclass classification problem using SVM, there are many ways
for us to return to the binary classification problem. Among them, the most
commonly used method is one-vs-rest (also known as one-vs-all, one-against-all)
[45]. Specifically, if there are classes, we will build models, in which, each
model corresponds to a class. Each of these models helps to distinguish whether a
data point belongs to that class or not or calculate the probability that a point falls
into that class. The final result can be the class with the highest probability.
10
Neural network
Neural networks are made up of single neurons, called perceptrons, which
are inspired by biological human neurons. Figure 3 depicts the structure of a
biological neuron.
Figure 3 A biological neuron.
Each biological neuron consists of three main components: (i) the cell body
is the bulge of the neuron, contains the cell nucleus, plays a role in providing
nutrition to the neuron, can generate nerve impulses, and can receive nerve
impulses transmitted to neuron, (ii) dendrites are short dendrites that develop from
cell bodies, which function to receive nerve impulses from other neurons, transmit
to the cell body, and (iii) axons are long single nerve fibers, responsible for
transmitting signals from the cell body to other neurons. Inspired by this, the
artificial neuron is designed with the structure depicted in Figure 4.
Figure 4 An artificial neuron.
Each neuron has inputs corresponding to dendrites, a processor using
activation functions corresponding to cell bodies, and neuron outputs
corresponding to axons. The activation function is usually a nonlinear function
such as sigmoid function, tanh function, ReLU function, sign function [49]
11
Many neurons combine together to become neural networks. Neural networks
usually have 3 layers: the input layer receives input data from the dataset, the
output layer shows the predicted value of the model for the input data, and the
hidden layer is the layers between the first layer input and output layers.
Recurrent Neural Network
With conventional neural networks, all inputs are independent of each other,
so there is no chain link between them. In word processing problems, the order
before and after in a document is very important. Based on this, the Recurrent
Neural Network (RNN) determines the value of the next element based on previous
calculations. Figure 5 depicts the structure of the RNN network.
Figure 5 The structure of the RNN network.
The computation inside the network at each step is calculated based on the input
at each step and the hidden state at the previous steps through some function like
tanh or ReLU, … The output at this step usually uses the softmax function.
Long-Short Term Memory
The theory has shown that for distant steps, the RNN has a problem of remote
dependence, i.e. the network can only remember a small interval. This happens
because of the vanishing gradient. This is the phenomenon that occurs when the
value of the gradient will get smaller as they go down the lower layers, so the
update performed by the Gradient Descent does not change much of the weights
of those layers, making them impossible to converge and RNN will not get good
results. The Long short-term memory network (LSTM) [49] was born to overcome
this limitation. LSTM also has a sequence architecture similar to RNNs, but instead
of having only one layer of neural networks, they have up to four layers that
interact with each other in a very special way.
12
Figure 6 The architecture of LSTM.
A special feature of LSTM is the cell state - the line that runs across the top
of the diagram. This is considered network memory. At each cell, the LSTM can
add or remove the necessary information through the ports: forget port, input port,
and output port, respectively, as shown in the figure. The forget gate will decide
what information is unnecessary and should be discarded in this state. The input
port indicates what information should be added to the cell state. The output port
decides the output of this cell. With such a structure, the LSTM network has the
ability to remember the more distant states, thereby creating better efficiency than
the RNN network.
1.2 BERT language model
BERT [50] stands for Bidirectional Encoder Representations from
Transformers, which is understood as a pre-learned model, also known as the pre-
train model, which learns two-dimensional contextual representation vectors of
words, which are used to transfer to other problems in the field of natural language
processing. Compared with Word Embedding models, BERT has more
breakthroughs in representing a word to a digital vector based on the word's
context.
The architecture of the BERT model is a multilayer architecture consisting
of several Bidirectional Transformer encoder layers. BERT Transformer uses two-
way attention mechanisms while GPT Transformer uses one-way attention
(unnatural, inconsistent with the way the language appears), where all words pay
attention only to the left context. Two-dimension Transformer is often referred to
as a Transformer encoder while versions of Transformer using only the left-hand
context are often referred to as a Transformer decoder because it can be used to
generate text. The comparison between BERT, OpenAI GPT and ELMo is shown
visually in Figure 7.
13
Figure 7 BERT, OpenAI and ELMo.
There are two tasks to create a model for BERT, includes Masked LM and
Next Sentence Prediction [27].
Masked LM
To train a representation model based on a two-dimensional context, we use
a simple approach to mask some random input tokens and then we only predict the
tokens hidden and call this task a "masked LM" (MLM). In this case, the hidden
vectors in the last layer corresponding to the hidden tokens are put into a softmax
layer over the entire vocabulary for prediction. Google researchers tested masking
15% of all tokens taken from WordPiece's dictionary in a sentence randomly
predicting only the masked words. Figure 8 is the BERT training scheme under the
masked LM task.
Figure 8 Masked LM.
Although this allows us to obtain a two-dimensional training model, two
disadvantages exist. The first is that we are creating a mismatch between pre-train
and fine-tuning because the tokens that are [MASK] are never seen during model
14
refinement. To mitigate this, we won't always replace hidden words with the
[MASK] token. Instead, the training data generator chooses 15% of tokens at
random and performs the following steps: For example with the sentence "con_chó
của tôi đẹp quá" (my dog is so beautiful), the word chosen to mask is "đẹp"
(beautiful), replace 80% of the words selected in the training data to token [MASK]
to "con_chó ca tôi [MASK] quá" (my dog is so [MASK]), 10% of the selected
words will be replaced by a random word, for example, "con_chó ca tôi máy_tính
quá" (my dog is so computer), the remaining 10% is kept unchanged as "con chó
ca tôi đp quá" (my dog is so beautiful).
Transformer encoder has no idea which word will be asked to predict or
which word has been replaced by a random word, so it is forced to keep a
contextual representation of each input token. Also, replacing 1.5% of all tokens
with a random word does not seem to affect the model's ability to understand the
language. The second disadvantage of using MLM is that only 15% of tokens are
predicted in each batch, which suggests that additional steps may be needed using
other pre-train models for the model to converge.
Next Sentence Prediction
Many important tasks in natural language processing such as Question
Answering require understanding based on the relationship between two text
sentences, not directly using language models. To train the model to understand
the relationship between sentences, we build a model that predicts the next
sentence based on the current sentence, the training data can be any corpus.
Specifically, when choosing sentence A and sentence B for each training sample,
there is a 50% chance that sentence B is the next sentence after sentence A and the
remaining 50% is a random sentence in the corpus.
In order for the model to distinguish between two sentences, we need to mark
the beginning positions of the first sentence with the token [CLS] and the positions
at the end of the sentence with [SEP]. Figure 9 shows an example input description
of a BERT model.
Figure 9 An example of the input in the BERT model.
15
1.3 Feature extraction
Feature extraction is the selection of text attributes and vectorizing them into
a vector space that can be easily processed by computers. In the following, we will
present some popular feature extraction operations.
Term Frequency Inverse Document Frequency (TF-IDF)
The TF-IDF value [51] of a word represents the importance of a word in a
document. TF (Term Frequency) is the frequency occurrence of a word in a
document and is calculated according to Equation 2.8.
󰇛 󰇜
󰇛 󰇜
󰆓

󰇛
󰆒
󰇜
(Eq. 1.8)
Where 󰇛 󰇜 is the number of occurrences of the word in the document .
The denominator in the above formula is the total number of words in the
document.
IDF (Inverse Document Frequency) is the inverse frequency of a word in a
corpus. The purpose of IDF is to reduce words that often appear in the text but do
not carry much meaning. The formula for calculating IDF is as follows:

󰇛
󰇜
 
󰇝

󰇞
(Eq. 1.9)
In which:  is the total number of documents in set , and the denominator
is the total number of documents in set containing the word . The IF-IDF value
is calculated as follows:
 
󰇛
󰇜

󰇛
󰇜
󰇛 󰇜
(Eq. 1.10)
Words with high TF-IDF values are those that appear more in one document
and less in another. This value helps us filter out common words and retain high-
value words (keywords of the document). TF IDF is a simple way to vectorize
textual data, but the magnitude of the vector is equal to the number of words,
increasing the computational load. Furthermore, word representation using TF-
IDF suffers from the problem of not being able to represent words that are outside
the dictionary and not being able to show relationships between words.
Word2Vec
Word2Vec is a method of mapping words into a vector space whose
dimensions are smaller than that of the dictionary while preserving the semantic
relationship of the words. It can be constructed using two methods: Skip Gram and
Common Bag of Word (CBOW) [52].
16
The CBOW model uses the context around each word as input and tries to
predict the word that corresponds to that context. The architecture of the model is
shown in Figure 10.
Figure 10 One-word CBOW model structure.
The input or the context word is a vector 󰇟

󰇠 encoded as one-
hot with size , which is the size of the dictionary, where
with is the
position of the dictionary word, otherwise
. The hidden layer consists of
neurons and the output layer is also a one-hot vector 󰇟

󰇠 of size .

is the weight matrix between the input and hidden layer.

󰆒
is the weight
matrix between the hidden layer and the output layer. The neurons of the hidden
layer just copy the sum of the weights from the inputs to the next layer. There are
no triggers like sigmoid, tanh or ReLU. The only nonlinearities are the softmax
calculations in the output layer. During target word prediction, we learn a vector
representation of the target word. SkipGram works similarly to the CBOW model,
SkipGram model takes as input a word and tries to predict the context around it.
Compared with IF-IDF, Word2Vec models have demonstrated semantic
relationships between words. The distance between pairs of words with similar
meanings will be approximately the same, for example, "king" - "queen" and
"man" - "woman". However, Word2Vec has not yet solved the problem of
representing words that are not in the dictionary.
Thus, through CHAPTER 1, we have grasped the basic theoretical bases for
the next chapters. Next, in CHAPTER 2, the author will present the process of
building data for the QA problem.
17
CHAPTER 2. BUILDING A VIETNAMESE QA DATASET
In this chapter, the thesis will present how to build data for the Vietnamese
Question Answering problem. The construction process uses two data collection
systems: the Written Collection System and the Speech Collection System.
2.1 Overview
With the proliferation of documents, the need to find information is
increasing day by day. A simple way to meet this need is to develop frequently
asked questions (FAQs) related to the organization, field, or issue that the
organization wants to convey. For example, the Microsoft Download Center
provides FAQs so that users can search for problems related to Microsoft products
and services [53]. The Ministry of Information and Communications Technology
also released the "Cm nang chuyển đổi s" (Digital Transformation Handbook)
[54], which includes questions and answers on the field of Digital Transformation,
based on the speeches of the Minister of Information and Communications Nguyen
Manh Hung.
However, when the number of questions is getting bigger and bigger, finding
questions similar to the problem you are interested in will take a lot of time and
effort. Search engines only return results for the words contained in the original
question. But in the reality of daily communication, we humans have many
different ways to express a certain request. For example, in the field of smart
homes, to ask about how to turn on the fan, we can directly ask "Làm sao đ bt
qut" (How do I turn on the fan) or say "Tôi mun cho qut chy thì làm thế nào"
(I want the fan to run, how do I do it). Apparently in the last example, the verb
"bt" (turn on) has been replaced with "chy" (run). If only word matching is used,
the system may confuse it with another question when there is missing
information. Although expressed in two different ways, we can all have the same
answer for instructions on how to use a fan in the case of smart homes. The two
questions in this case can be considered “similar” to each other. Therefore,
approaching the QA system in the direction of developing similar questions for
FAQs is a softer approach, helping the system to understand the input questions
more flexibly.
For the QA system to work effectively in this approach, building a high-
quality similarity dataset is essential. However, organizations do not always have
large enough data available for training the QA model. Several methods have been
proposed to enhance the data collection capabilities of the QA system. One of the
methods is to use machine translation to automatically generate questions that are
similar to the questions in the database of the system. However, this method still
has many limitations, because the translation results of the current machine
translation methods have not yet reached high accuracy, leading to the generated
data not really meeting the requirements of a quality QA system. Another method
is to collect data from various sources, such as discussion sites, online teaching
18
sites, question-and-answer forums, and online conversations between humans.
However, collecting and processing data from these sources requires special skills
and tools, and it is also necessary to ensure the reliability of the collected data.
Therefore, to achieve the best efficiency in building QA systems, organizations
need to have a clear and reasonable data collection process, combining many
different methods to collect quality and reliable data for training the QA model.
The data to be collected through the organization's process should be divided
into two categories: training data and testing data. Training data needs to include
enough cases and the basic questions that the organization asks. At the same time,
testing data also needs to be created to evaluate the capabilities of the QA model
and ensure that it meets the requirements and can work correctly in different
scenarios. In particular, with user input via speech, the QA model needs to be
evaluated under the influence of the ASR module. This is especially important with
live mobile QA systems, where the use of speech is common. However, the ASR
process can face many problems such as unclear utterance, sound disturbance,
noise, and different pronunciations by users, affecting the ability of the QA model
in giving the correct answer. Therefore, evaluating the QA model under the
influence of the ASR module is an important factor in evaluating the operability
of the QA model with speech input.
When building data for an AI model, the data is usually split in a certain ratio,
for example, 80:20 or 70:30 to produce training data and test data. The training
data is used to train the QA model to understand the user's input question and give
the corresponding answer. The test data will be used to evaluate the quality of the
QA model. Considering the input of the model is speech if building data from
speech and using the above building data method, the cost of data construction will
be extremely expensive. Because the speech data itself takes a lot of work and time
to collect and evaluate, it is not easy to fine-tune to create a high-quality dataset.
On the other hand, the current QA systems mostly follow the direction of
modularization, in which the system will use an ASR module to convert the speech
received from the user into text, then analyze this text to understand the user's
question. Therefore, we can only build text data to train the model and evaluate the
trained model on speech data to see the correct quality of the model under the
influence of the ASR module. This approach minimizes the cost of data building
because text can be constructed faster and less expensive than speech. Based on
this, the thesis proposes a data-building process consisting of 2 main steps: step 1
is to build training data, which focuses on text data and step 2 is to build test data
with the received data in speech.
2.2 The process of building a Vietnamese QA dataset
Initial dataset
The initial dataset provided by the Ministry of Information and
Communications, consisting of the corresponding question and answer pairs,
19
belongs to the field of Digital Transformation. Each pair of questions is divided
into a group for management purposes. The dataset includes 194 question-answer
pairs, divided into 9 groups. These questions were built on the basis of the speeches
of the Minister of Information and Communications Nguyen Manh Hung, related
to issues of basic concepts, planning mechanisms, policies, and strategies,
including issues related to digital transformation in Vietnam and around the world.
Questions in different groups are clearly differentiated, however, some
questions within the same group can have a high degree of ambiguity. Some
examples of question-answer pairs are given in Table 1.
Table 1 Some examples of questions in the initial dataset
Num.
Group
Question
Answer
1
Câu hỏi chung về
chuyển đổi số
(General
questions about
digital
transformation)
Chuyển đổi số
gì?
(What is digital
transformation)
Chuyển đổi số bước phát triển
tiếp theo của tin học hóa, có
được nhờ sự tiến bộ vượt bậc
của những công nghệ mới mang
tính đột phá, gọi chung công
nghệ số.
(Digital transformation is the
next development of
computerization, made possible
by the remarkable progress of
breakthrough new technologies,
collectively known as digital
technology.)
2
Câu hỏi chung về
chuyển đổi số
thể nói hơn
về chuyển đổi số
được không?
(Can you be more
specific about
digital
information?)
Bạn có thể tham khảo cuốn Cẩm
nang Chuyển đổi số do Bộ
Thông tin Truyền thông xuất
bản, hoặc truy cập website
dx.mic.gov.vn để tìm hiểu thêm
thông tin chi tiết.
(You can refer to "Cẩm nang
chuyển đổi số" (the Digital
Transformation Handbook)
published by the Ministry of
Information and
Communication, or visit the
website dx.mic.gov.vn for more
detailed information.)
3
Làm một số
khái niệm liên
Học máy là gì?
Học máy một nhánh nghiên
cứu của trí tuệ nhân tạo và khoa
20
Num.
Group
Question
Answer
quan đến chuyển
đổi số
(Clarifying some
concepts related
to digital
transformation)
(What is machine
learning)
học máy tính tập trung vào sử
dụng dữ liệu các thuật toán để
bắt chước cách con người học.
(Machine Learning is a branch
of research in Artificial
Intelligence and Computer
Science that focuses on using
data and algorithms to mimic
how humans learn.)
In Table 1, the first question "Chuyển đổi s gì" (What is digital
transformation?), and the second "Có th nói hơn v chuyển đi s được không"
(Can you be more specific about digital transformation) are in the group "Các câu
hi chung v Chuyển đổi s" (General questions about digital transformation).
When asking such questions, questioners are all looking to learn about the concept
of digital transformation. However, the answer to the first question can be
answered in a succinct way, while the second question needs to provide an
explanation in more detail, or it is necessary to provide the questioner with a useful
source of information that can be consulted to better understand digital
transformation. Question number 1 and 3 are two questions in two different groups,
"Câu hi chung v chuyển đổi số" and "Làm rõ các thông tin liên quan đến Chuyn
đổi s" (Clarification of some concepts related to digital transformation). These
questions are a pretty clear separation in terms of semantics as the first question
asks about the concept of digital transformation and the second one deals with the
concept of machine learning.
The questions in the initial dataset also vary in question length. Information
about the length of questions in this dataset is given in Table 2.
Table 2 Information about question length
Content
Quantity (syllable)
Average question length
11
Shortest question length
3
Longest question length
42
25%
7
50%
10
75%
14
21
As shown in Table 2, the average question length is 11 syllables. The shortest
questions like "NGSP gì?" (What is NGSP?) and "LGSP gì?" (What is
LGSP?), are quite simple, with only 3 syllables in each question. The longest
question is up to 42 syllables in length. The questions in the dataset were mostly
short and medium questions, with 25% of the questions in the dataset being less
than 7 syllables in length, 50% of the questions in the dataset being 10 syllables in
length or less, and 25% of the questions in the dataset have the number of words
in the sentence greater than 14 syllables. Short questions are usually questions
about concepts, purposes, roles, or ways of doing a particular problem. Long
questions can consist of many simple sentences put together or contain one or more
complex conceptual terms put together. Table 3 gives examples of some of the
short and long questions included in the dataset.
Table 3 Some examples of short and long questions
Type
Question
Answer
Short
question
Cng dch v công quc
gia là gì?
(What is the National
Public Service Portal?)
Cng Dch v công Quc gia cng tích
hp thông tin v dch v công trc tuyến,
tình hình gii quyết, kết qu gii quyết
th tục hành chính trên s kết ni,
truy xut d liu t c H thng thông
tin mt cửa đin t cp b, cp tnh
các gii pháp h tr nghip v, k thut
do Văn phòng Chính phủ thng nht xây
dng, qun lý.
(The National Public Service Portal is a
portal that integrates information about
online public services, settlement
situations, and results of administrative
procedures settlement on the basis of
connecting and retrieving data from
one-stop information systems. Ministry-
level and provincial-level electronics
and professional and technical support
solutions are uniformly built and
managed by the Government Office.)
Truy cp d liu m quc
gia đâu?
(Where is National Open
Data access?)
th truy cp d liu m của quan
nhà nước ti Cng d liu quc gia ti
địa ch data.gov.vn.
22
Type
Question
Answer
(Open data of state agencies can be
accessed at the National Data Portal at
data.gov.vn.)
Vai trò ca kinh tế s gì?
(What is the role of the
digital economy?)
Kinh tế s giúp tăng năng suất lao đng,
giúp tăng trưởng kinh tế. Kinh tế s cũng
giúp tăng trưởng bn vng, tăng trưởng
bao trùm, s dng tri thc nhiều n
tài nguyên. Chi phí tham gia kinh tế s
thấp hơn nên tạo ra hội cho nhiu
người hơn.
(The digital economy helps increase
labor productivity and helps economic
growth. The digital economy also helps
sustainable growth, and inclusive
growth, because it uses more knowledge
than resources. The lower cost of
participating in the digital economy
should create opportunities for more
people.)
Long
question
Mt trong nhng mc tiêu
của Chương trình Chuyển
đổi s quc gia to ra
môi trường s nhân văn,
rng khp. Mc tiêu này
cần đưc hiu như thế
nào?
(One of the goals of the
National Digital
Transformation Program
is to create a humane,
widespread digital
environment. How should
this goal be understood?)
Mt trong nhng thế mnh ca c nn
tng s kh năng mở rng tiếp cn
người dùng.
Do vậy, người dân khp mi min t
quốc đều th được tiếp cn dch v
mt cách bình đng.
Ngưi dân vùng sâu vùng xa, biên gii
hải đảo vn th s dng dch v y tế,
giáo dc tt nht.
Đó chính ý nghĩa nhân văn ca chuyn
đổi s.
(One of the strengths of digital platforms
is scalability and user reach.
Therefore, people in all parts of the
country can have equal access to
services.
People in remote areas and bordering
islands can still use the best health and
education services.
23
Type
Question
Answer
That is the human meaning of digital
transformation.)
Hiện nay, đã những
doanh nghip Vit Nam
nào đáp ứng đưc tiêu chí,
ch tiêu k thuật đánh giá,
la chn gii pháp nn
tảng điện toán đám mây?
(Currently, are there any
Vietnamese enterprises
that meet the criteria and
technical criteria for
evaluation and selection of
cloud computing platform
solutions?)
Hiện nay đã 05 doanh nghip Vit
Nam, bao gm: Viettel, VNG, CMC,
VNPT VCCorp đã đáp ng theo b
tiêu chí ca B Thông tin Truyn
thông.
(Currently, there are 05 Vietnamese
enterprises, including Viettel, VNG,
CMC, VNPT and VCCorp, which have
met the criteria set by the Ministry of
Information and Communications.)
Các chu trình lưu chuyn,
x th tc hành chính
trên h thng thông tin mt
cửa điện t cp b, cp tnh
th chnh sa linh hot
để phù hp với quy đnh
ca th tc hành chính
không?
(Can the circulation and
handling of administrative
procedures on the
electronic one-stop
information system at the
ministerial and provincial
levels be flexibly modified
to conform to the
regulations of
administrative
procedures?)
Có, đây một yêu cu bt buộc được
quy định trong Thông số 22/2019/TT-
BTTTT ca B Thông tin Truyn
thông.
Theo đó, các chu trình lưu chuyển, x
th tc hành chính trên h thng thông
tin mt ca cp b, cp tnh phi cho
phép điều chỉnh động, linh hot trong
vic định nghĩa quy trình, thủ tc.
(Yes, this is a mandatory requirement
specified in Circular No. 22/2019/TT-
BTTTT of the Ministry of Information
and Communications.
Accordingly, the circulation and
handling of administrative procedures
on the one-stop information system at
the ministerial and provincial levels
must allow dynamic and flexible
adjustment in the definition of processes
and procedures.)
24
In addition to the questions that are highly specific to the field of Digital
Transformation, this dataset also contains questions with proper names, acronyms,
and English words like the examples given in Table 4.
Table 4 Special questions
Type
Question
Answer
Questions
containing
proper names
Make in Vit Nam là gì?
(What is Make in
Vietnam?)
Make in Việt Nam định hướng
chuyn t gia công lp ráp sang sáng
to ti Vit Nam, thiết kế ti Vit
Nam và làm ra ti Vit Nam.
T trng Make in Vit Nam t hin
tại đang 22% sang đt trên 45%
vào năm 2025.
(Make in Vietnam orients from
outsourcing, assembling to creating
in Vietnam, designing in Vietnam,
and making in Vietnam.
The proportion of Make in Vietnam
from currently is 22% to over 45%
by 2025.)
Mng IPv6 thun là gì?
(What is a pure IPv6
network?)
Mng IPv6 thun là mng trong đó
các thiết b ch giao tiếp vi nhau
qua giao thc h tr IPv6 mà không
cn thc hin chuyển đổi sang IPv4.
(A pure IPv6 network is a network in
which devices only communicate
with each other over an IPv6-
enabled protocol without making the
transition to IPv4.)
The question
contains the
acronym
NGSP là gì?
(What is NGSP?)
NGSP t viết tt ca National
Government Service Platform,
nghĩa tương đương với Nn tng tích
hp, chia s d liu quc gia.
(NGSP is an acronym for National
Government Service Platform,
25
Type
Question
Answer
which is equivalent to National Data
Sharing and Integration Platform.)
LGSP là gì?
(What is LGSP?)
LGSP t viết tt ca Local
Government Service Platform,
nghĩa tương đương với Nn tng tích
hp, chia s d liu cp B, ngành,
địa phương.
(LGSP is an acronym for Local
Government Service Platform,
which is equivalent to the Ministry,
Sector, and Local Data Sharing and
Integration Platform.)
The question
contains
English words
Mobile money là gì?
(What is mobile money?)
hình thí đim cho phép dùng
tài khon vin thông thanh toán cho
các hàng hóa, dch v có giá tr nh.
Mobile money đã được Th ng
phê duyt ti Quyết định s 316/QĐ-
TTg ngày 09/3/2021.
(It is a pilot model that allows the
use of telecommunications accounts
to pay for goods and services of
small value.
Mobile money has been approved by
the Prime Minister in Decision No.
316/QD-TTg dated March 9, 2021.)
Đào tạo nâng cp k năng
s (up-skill) là gì?
(What is up-skill?)
Đào tạo nâng cp k năng số quá
trình đào tạo trang b kiến thc
m rng các k năng hin ca
người lao động để đáp ng nhu cu
ca công vic.
(Up-skill is a training process that
equips employees with knowledge
and expands existing skills to meet
the needs of the job.)
26
ASR models are often trained with general domain data. With proper names,
acronyms, and English words as in the examples given in Table 4, those models
may not correctly recognize these contents, making it difficult to identify the
question to provide answers to the QA model. Therefore, this is also a challenging
point for data construction and problem-solving using speech input.
Overall process
A quality data set is important to ensure the intelligence of AI models.
Although the initial dataset can be thought of as the knowledge set for the QA
problem, with a single example it is impossible for the computer to understand the
user's question and give the answer. Therefore, it is necessary to build a training
dataset so that the computer can perform that task. Based on the idea of similar
questions, where two questions are considered similar when they have the same
answer, we propose a process to build training data from the initial set of data. The
details of this process are illustrated in Figure 11.
Figure 11 Data building process.
First, the initial data set is normalized and fed into two data-building
processes, including the training data-building process and the test data-building
process. In particular, the process of building training data is responsible for
creating learning data for the QA model and the process of building test data to
help build a dataset to evaluate the quality of the built model. Based on the concept
of similar questions, in the training data-building process, the thesis uses a Written
Collection System that allows participants to build data to rewrite the original
question in an equivalent form. Besides, a speech Collection System is used in the
test data-building process, asking the participant to provide a question similar to
the one given and recording this question. Data during the building process and
after being provided by data contributors will be re-checked to avoid duplication
as well as to detect ambiguity between developed questions. Finally, they are
manually evaluated by professionals to ensure the quality of the data obtained.
27
The next section of the thesis will discuss in more detail the use of the
aforementioned data building processes.
Training data building process
As mentioned in part 2.2.2, the Written Collection System is used for the
purpose of collecting questions that are similar to the questions in the initial
dataset. For each data collection turn, the collaborator contributes according to the
process described in Figure 12.
Figure 12 Written data collection process.
From the initial set of data, contributors need to provide questions that are
similar to the existing questions according to the process shown in Figure 12. The
process consists of three main steps: (i) question selection, (ii) data collection and
(ii) checking duplicates.
During the question selection step, the system needs to present the
contributor with a list of questions to collect. In order to diversify the data of
contributions, in each collection turn, the system only displays 5 questions.
Initially, when there was no data to contribute, the questions were randomly
selected from the database. When the system starts to have data contributed, the
original questions will have a different number of similar questions. At this point,
the system will sort the questions in ascending order of the number of similar
questions. Next, the system will select 5 random questions from the top 20 with
the least number of similar questions. This allows the questions to be randomly
collected while ensuring the balance of the data.
After the system has selected the questions to be collected, the process moves
to the data collection step, the contributors write questions similar to the ones given
and send them back to the system. Figure 13 shows the data collection interface
in each turn with 5 sentences selected by the system for participants to contribute.
28
Figure 13 The data collection interface.
The number of questions to collect is relatively large, for example, if the goal
is to collect 10 similar questions for each original question, the number of
questions to be collected can be up to 1,950 questions. During the collection
process, contributors can enter sentences that already exist in the system. These
sentences can be contributed by someone else or entered by the person himself. In
order to limit data duplication and reduce the diversity of training data, the
checking duplication step is performed to detect the sentences provided by
contributors already in the system. In this step, the collection system performs data
processing steps including removing redundant spaces and special characters
[!\"#$%&\'()*+,/:;< =>?[\]^`{|}~], converts the text to lowercase and normalizes
accents (“òa” → “oà”, "óa" → "oá", "ỏa" → "oả", "õa" → "oã", "ọa" →"o", "òe"
→ "oè", e" → "oé", "ỏe" → "oẻ", "õe" → "oẽ", "ọe" → "oẹ", y" → "u", "úy"
→"uý", "ủy" "uỷ", "ũy" "uỹ", "ụy" →"uỵ"). The system then checks whether
the provided question is included in the existing questions. If the question already
exists, the system alerts and asks the collector to edit or provide another sentence.
On the contrary, the system allows the user to save the data, end the collection turn
and move on to the next one.
The collected data will be evaluated through a process shown in Figure 14.
Figure 14 Training data evaluation process.
After collecting data through the Written Collect System, the evaluation
process is an indispensable step to ensure the quality of the data. This is especially
important when participants provide multiple questions at once because there can
be situations where they mistype questions that are similar to other questions. This
creates ambiguity and makes it difficult to process and use data. To solve this
29
problem, it is necessary to develop methods and tools to detect such errors in the
collected data. One possible approach is to use the K-folk learning technique on
the obtained data. Specifically, with this approach, we consider each initial
question and its similar questions to form a class label. At this time, the K-folk
learning technique will randomly divide the data set into k groups, each group
includes all the class labels. Then, we choose k-1 groups to train a text classifier
model and predict class labels for the data in the remaining group. If the data is
predicted to have a class label different from the original data's label, it is possible
that this is mistyped and the evaluator should check for these cases. This process
is performed on all groups. In the thesis, we use this technique with k=5 and the
classification model used is Random Forest. The process of checking and re-
evaluating the data by experts is the next important step. Experts with in-depth
knowledge of the relevant field will ensure that the contributed questions are
similar to the original questions and that there is no confusion or
miscommunication. They can make comparisons and check the accuracy of
questions, as well as make adjustments or suggest necessary modifications.
Through this process, the data will be clearer and better quality, ensuring its
reliability and accuracy. This is extremely important in using data for analysis or
training AI models, as inaccurate or conflicting data can lead to unreliable results
or misleading instructions.
Test data building process
Although providing a quick way to construct data, data collected through the
Written Collection System is influenced by the writing style of the collaborator.
Meanwhile, when communicating by speech, along with communication content,
the way we speak will be different from when we write. For example, when
speaking, we can add words such as "à", "", (yeah) or conjunctions "thì" (then),
(that),... or the sentence structure may not be as complete as when writing.
Therefore, in addition to data collected through Written Collection System, the
process provides a second form of collection through speech. With this form,
participants will be asked to provide questions similar to root questions by
recording. The data collection process using this system is depicted in Figure 15.
30
Figure 15 Speech data collection process.
Similar to the written data collection process, the speech data collection
process also requires participants to provide similar questions to the initial dataset.
This process consists of five main steps: (i) question selection, (ii) data collection,
(iii) transcription generation, (iv) checking duplication, and finally (v) content
editing.
In this process, the question selection step was also performed to randomly
generate questions to be collected for contributors. However, different from the
written data collection process described before, because the process has more
complicated, at this step, the system only displays one question to be collected at
a time. This question is also randomly selected in the top 20 questions with the
least number of contributing sentences.
The data contributor will then read the question provided by the system and
ask a similar question. Once the question has been prepared, the contributor
proceeds to record the question as shown in Figure 16.
31
Figure 16 Speech data collection interface.
After the question is recorded, the system moves to the step of generating a
transcript. In this step, the system uses an ASR module to convert the recorded
content into text. Then, the system checks for duplicates with the original question
to ensure the diversity of the dataset. The way to check for duplicates is the same
as in the Written Collection System. If the questions are duplicated, the
contributors need to be re-recorded. Otherwise, the system moves to the next step.
At the same time, the transcription also helps the contributors not need to re-enter
what they just said from the beginning, but only need to re-edit the transcript. On
the other hand, since the input to QA models, receiving data from the source as
speech will need to go through an intermediate ASR model, the transcript
generated by the model is also the data to evaluate the effect of the ASR module
on the quality of the QA model when using speech as input. With the transcript
provided, if the transcript does not match what was said, the contributor will
correct and save it. At this point, a data collection turn ends.
In the process of building data in speech form, there are some characteristics
that are different from the process of collecting data in written form. In each
collection turn, collaborators were only required to record only one similar
question to the original question. This reduces the possibility of ambiguity
compared to writing data collection. Each similarity question collected during this
process is recorded and identified by both the question content and the speech
transcription. The test data-building process creates two important types of test
data. The first is the transcription of the recorded voice, which is the audio-to-text
conversion, which provides a good way to test the influence of the ASR module
on the QA model. Second, is the content said by the participant, which is a similar
question as the original question. This data provides important information for
assessing the similarity between questions and ensuring the accuracy and
reliability of the data. Evaluating data using only an expert method is effective.
32
Experts with in-depth knowledge of the relevant field will test and re-evaluate both
transcription and content to determine the accuracy, consistency, and similarity of
questions. They can compare similar questions with the original question and make
comments, and adjustments, or suggest modifications needed for the subsequent
collection turn. This evaluation process helps ensure that the data collected is
accurate, consistent, and similar to the original question, thereby producing high-
quality test datasets to evaluate the QA model's capabilities in the real-world with
speech input.
2.3 Data analysis
Data collection campaigns
Following the process described in section 2.2, we carried out two data
collection campaigns, using a combination of both previously mentioned
collection systems. Information about data collection campaigns is given in Table
5.
Table 5 Data collection campaigns information
Campaign 1
Campaign 2
Number of root questions
175
194
Number of collaborators
27
32
Male/Female
15/12
15/17
Age
19 - 23
19 - 22
Northern / Central /
Southern Vietnam
12/6/9
12/10/12
The data collection proceeded through 02 campaigns. The first campaign
used 174 question-answer pairs as the base data set, while the second dataset was
expanded from the first dataset with 194 pairs of questions and answers. These
pairs are all provided by experts in the field of Digital Transformation of the
Ministry of Information and Communications. The collaborators who contributed
the data were all between the ages of 19 and 23 years old. To ensure data diversity,
especially in recording speech from different regions, selected collaborators
contribute data with equal proportions in terms of gender and region. domain.
Specifically, in the first campaign, a total of 27 collaborators participated,
including 15 men and 12 women. Among them, there are 12 collaborators with a
Northern accent, 6 collaborators with a Southern accent, and 9 collaborators with
a Central accent. With the 2nd campaign, the number of collaborators has
increased, of which 32 collaborators participated in data construction, including
15 men and 17 women. The proportion of regions in the 2nd campaign is also more
evenly distributed, with 12 collaborators speaking with a Northern accent, 10
33
collaborators with a Central accent, and 12 with a Southern accent. Once collected,
the data is checked and evaluated by data experts before training and deploying
the QA models.
Data collection results
Through 02 campaigns, the dataset for the QA problem was built with the
information described in Table 6.
Table 6 Information about data collected through campaigns
Written Collection System
Speech Collection System
Campaign
1
Campaign
2
Total
Campaign
1
Campaign
2
Total
Total
2.153
3.646
5.799
1.100
1.819
2.909
Number of
root question
175
194
194
175
194
194
Average
sentences/root
question
12
18
29
6
9
14
Minimum
sentences/root
question
8
1
14
2
4
5
Maximum
sentences/root
question
25
38
56
15
14
29
Through 02 campaigns of data collection, a total of 5,799 similar questions
were built on a total of 194 original questions. Each question has an average of 29
similar questions. The number of questions that are at least similar to an original
question is 14 and at most 56. With the speech data collection system, with 194
original questions, the total amount of data collected is 2,909 data, including audio
and text. Each original question has an average of 14 questions collected. The
minimum number of data collected on a question is 5 and the maximum is 29.
Regarding the number of words in a sentence, Figure 17 shows the
distribution of collected data by length with the Written Collection System.
34
Figure 17 Distribution of the number of words in a sentence
with the Written Collection System.
As depicted in Figure 17, the number of words in the sentences ranges from
3 words to 47 words, in which, the data is concentrated in the region from 6 to 18
words, and the largest amount of data is concentrated in the region of 10 words.
Besides, short sentences and long sentences are relatively few in number. This is
appropriate because 75% of the original questions are 14 words or less in length.
With data collected through the Speech Collect System, the distribution of
the data length is illustrated in Figure 18.
Figure 18 Distribution of word count in the data collected by the speech system.
35
Similar to the written system, the data collected by the speech system is from
3 to 48 words in length in a sentence. The word count of a sentence is concentrated
in the range of 6 to 20 words. Compared with the words in the sentence relabeled
by the data contributor, the words in the transcribed sentence have a slight
difference due to the influence of the ASR module on what the participant actually
said. In general, the distribution of this data is similar to the distribution of the data
collected by the written system.
2.4 Data Disclosure
Through two data collection systems, the thesis has built a data set for
building and deploying the QA system. The dataset includes:
(i) training data is built from similar questions that have been contributed
through the written system,
(ii) test data with correct speaker content is built from the labeling of audio
content during recording, called User test,
and (iii) test data with transcribed content generated by the output of the ASR
module during recording, called ASR test.
Information about the size of the dataset, presented in Table 6, includes 5,799
similar and 194 original questions used in training and 2,909 test data for each test
dataset. All this data is saved as a JSON file, in UTF-8 format.
To the best of the author's knowledge, this is the first published dataset for a
QA problem with the Digital Transformation data domain. Besides the data in this
dataset in question answering studies, the data construction process can be
extended to other data domains not just Digital Transformation.
36
CHAPTER 3. VIETNAMESE QUESTION ANSWERING MODEL,
EXPERIMENT AND EVALUATION
In CHAPTER 2, the thesis has presented a process of how to build training
and testing data for the QA problem, specifically in the field of Digital
Transformation. Next, this chapter will present QA models in Vietnamese and
experiments to evaluate that model with constructed data.
3.1 Vietnamese Question Answering problem
The goal of the QA model is to provide an answer to a user input question.
In fact, with the richness of language, along with a question, each person can
express it in many different ways. For example, along with the purpose of asking
about the concept of digital transformation, we have many ways to express it such
as "Chuyển đổi s gì"? (What is Digital Transformation?) or "Bn th nói
cho tôi v khái nim ca Chuyển đổi s được không?" (Can you tell me the concept
of Digital Transformation). Since they have the same purpose of asking, these
questions can be answered in the same way. Therefore, the QA problem can be
considered as a text classification task, in which the QA model needs to group the
user's utterances into the class of the closest question or be considered as a
comparison similarity task, in which it is necessary to find the question with the
highest similarity to the user's utterance. Following this analysis, the thesis
researches and experiments with the text classification models and similarity
comparing models with the built data.
Thus, we can state the QA problem as follows: given a set of questions ,
each question
of is answered by an answer
of set . Each question
also
has one set of questions with similar meaning, also known as similar questions



where is the number of questions similar to
. For a question ,
we need to give the corresponding answer . Since the questions in sets and
are one-to-one linked, the task of the problem becomes to find the question that
is closest in meaning to .
In the next part, the thesis will present how to solve the QA problem in two
directions: classifying documents and comparing the similarity between
two questions.
Text classification problem
Going in the direction of text classification, the data in the QA problem is
divided into different classes, each class includes the original question
and its
similar questions



. In the built dataset, if we have m original
questions then we will also have corresponding class labels 


. As
such, each class relates to a single root question, and each question also relates to
a unique answer. Therefore, to find the answer to a new question, the task of the
QA problem is to find the class label for that question. Once this class label is
found, the answer corresponding to the original question in that class will be the
37
answer to the new question. From the built dataset, we train the text classification
model. This model will then be used to predict the label  for the new question .
The architecture of the classification model is depicted in Figure 19.
Figure 19 Text classification model architecture.
First, the input question goes through the data preprocessing step. At this
step, the model will clean the data to avoid disturbing the classification model.
Some ways to preprocess data include removing special characters, removing extra
spaces, normalizing punctuation, etc. Because computers cannot calculate and
process with character data, the next processing step is feature extraction which is
responsible for selecting the attributes of the data, from those attributes,
representing the data in the form of feature vectors. TF-IDF, CBOW, Word2Vec,
etc. are effective methods to use in feature extraction for natural language
processing problems. From these vectors, a classification model such as Naïve
Bayes, SVM, LSTM, etc. is used to learn how to generalize the input data and the
class labels. Finally, a probability vector is generated, representing the result of the
guess. The class label is predicted to be the class label with the highest probability.
Similarity questions problem
With the problem approach in the direction of similar questions, for each
question , the task of the problem needs to find the question that is similar to
, based on the built data. The pre-trained model Siamese Bert [27] (Sbert) is
proposed to solve this problem. The architecture of the model is illustrated in
Figure 20.
38
Figure 20 Similarity comparison model Sbert.
This model is inspired by the Siamese neural network, which is a structure
that uses two identical neural networks. The purpose of this model is to calculate
the distance between two sentences through cosine distance.
As the questions pass through each BERT network, we obtain a vector
representing the semantics for that question. This vector continues through the
pooling layers, creating vectors u and v that represent the question semantics.
Finally, these two vectors are used to calculate the cosine distance, the larger the
distance, the closer the semantics of the two sentences are. This value ranges from
0 to 1, where a distance of 1 means that the two sentences are completely similar,
and 0 means that the two sentences are not similar at all.
Similar to the implementation in the approach by text classification, we also
divide the built data set into classes, each class consists of the original question
and its similar questions



. To build the training data for the SBert
model, two consecutive questions in a class form a pair of similar data and are
labeled 1. At the same time, for each question in a class, we randomly select
questions 󰇛
󰇜 from other classes to form a pair of dissimilar questions. Each
pair of dissimilar questions is labeled as 0.
For a new question ,  will be compared with all the original
questions󰇛
󰇜 in the original dataset through the fine-turned Sbert
model. Sbert will calculate and return the similarity value
󰆒
for each pair of
39
questions 󰇛
󰇜 . The answer to the question with the highest
similarity to q' will be considered the answer to q'.
In the text classification approach, each original question corresponds to a
class label, and in this approach, each new question is compared to find the original
question that is similar to it. Therefore, to easily compare the two approaches, for
each question in the test set, we will give the predicted class label for that question.
This class label is predicted directly from the classification model with the first
approach and is returned indirectly from the label corresponding to the question
with the closest similarity. Finally, we compare the predicted class label with the
question's actual class label to evaluate the effectiveness of the experimented
QA models.
3.2 Experiment setup
Experiment models
In this thesis, the author experiments and evaluates the QA model with the
text classification model and the similar question comparison model. Specifically,
the following models are included:
(i) Random Forest text classification model, with the use of GridSearch
parameter tuning algorithm with input parameters including the number of tree
 100, 200, 500 and 1000, the function evaluates the quality of a slice
of criterion gini and entropy. The feature extraction method used is also TF-IDF.
(ii) SVM model, using GridSearch parameter tuning algorithm. The
parameters to be tuning include the regular parameter (1, 2, 5, 10), the
coefficients for the kernel functions  and the kernel function used is
linear. The test also uses the feature extraction method, TF-IDF.
(iii) Text classification model using LSTM regression network, with feature
extraction method is Word2Vec with  ,  ,
  and loss function is used as categorical_crossentropy.
(iv) Text classification model using pre-train language model PhoBert [55]
combined with K-fold technique with .
(v) Similar question model SBert with pre-train language model PhoBert
[55].
The above are all effective models in the problem of text classification and
natural language understanding. SVM is capable of finding good decision
boundaries between different text classes. Meanwhile, Random Forest is capable
of processing discontinuous documents and minimizing overfitting. LSTM is a
recurrent neural network architecture suitable for time series-based data like text.
It has the ability to remember past information and understand complex patterns
in text. LSTM is especially useful when dealing with natural language models.
40
BERT is capable of understanding context and representing the meaning of each
word in a sentence, helping to capture complex information and represent high-
quality text. The hyperparameters and configurations of QA models used in
experiments are presented in Table 7.
Table 7 Model’s hyperparameter
Hyperparameter
Value
Random Forest
Number off decision tree
100, 200, 500, 1000
The function evaluates the quality of a slice
Gini, entropy
SVM
Regular parameter
1, 2, 5, 10
Kernel function
linear
The coefficients for the kernel functions
0.1
LSTM
Embedding output dimension
400
LSTM output dimension
128
LSTM activation function
tanh
LSTM recurrent activation function
sigmoid
Classification activation function
softmax
Optimizer
Adam
Loss function
Cross-entropy
Epoches
50
Pre-trained model PhoBert
Attention probs dropout prob
0.1
41
Hyperparameter
Value
Hidden dropout prob
0.1
Hidden size
768
Layer norm eps
1e-05
Max position embeddings
258
Model type
Roberta
Number of attention heads
12
Number of hidden layers
12
Pad token id
1
Type vocab size
1
Vocab size
64.001
PhoBert classification model (includes pre-train PhoBert)
Classification dropout
0.3
Optimizer
Adam
Learning rate
2e-5
Loss function
Cross-entropy
Epoches
6
Sbert (includes pre-trained Phobert)
Optimizer
Adam
Learning rate
5e-6
Loss function
MSE
Epoches
7
42
All these experiments were performed in the Google Colab environment,
with the Ubuntu 18.04 LTS operating system, 13 GB of RAM, and 80 GB of
memory. Google Colab also provides an experiment environment with Python 3.6
and a Tesla T4 GPU.
Evaluation Criteria
To evaluate the results of the experimental models, the author uses the
score [45]. This score is often used in the text classification problem, in which the
formula for calculating
is generalized for the multi-class text classification
problem from the two-class text classification problem. This score is used because
of the fact that in two classes of the two-class classification problem, one class is
more severe than the other. For example, with the problem of spam classification,
erroneously predicting an important message as spam will have greater
consequences than misclassifying spam as a regular email, because this will make
users missed important information. The more important data layer is called the
positive class, the other layer is called negative class. From these two classes, we
have confusion matrix about True Positive (TP), False Positive (FP), True
Negative (TN), False Negative (FN) given as shown in Table 8.
Table 8 Confusion matrix
Prediction: Positive
Prediction: Negative
Ground truth: Positive
True Positive (TP)
False Negative (FN)
Ground truth: Negative
False Positive (FP),
True Negative (TN)
Based on these values, we will calculate the Precision and Recall score, in
which, high Precision represents the accuracy of the data predicted as positive class
is high. A high Recall represents missing points that are actually “positive” low.
Formula to calculate Precision and Recall according to Equation 4.1.


 


 
(Eq. 3.1)
The higher the Precision, the more accurate the prediction of the positive
class is, but this does not represent how many points that are actually positive
points have been wrongly guessed. In contrast, Recall shows how many positives
were actually wrong but does not show how many positives were correctly
predicted out of the total number of positive predictions. Therefore, the scale
was born to evaluate the combination of both problems by averaging the harmonics
of Precsion and Recall according to Equation 4.2.
43


(Eq. 3.2)
The higher the value of
, the more efficient the classification model is
because this means that both Precision and Recall must have good values.
For text classification models,
is a commonly used score, however, for the
problem of comparing the similarity of two questions, the author uses
in
evaluating the results of the problem by considering the the original question is a
class label. When a question to be predicted is evaluated as similar to the question
in the database, we consider that the model to predict that question has the class
label corresponding to the original question.
3.3 Results and evaluations
The results of the evaluation of the experimental QA models are shown in
Table 9. These results are all measured on the
score.
Table 9 Evaluation results of experimented models
Model
User test
ASR test
Difference
between two test sets
Random Forest
98,3%
94,4%
3,9%
SVM
98,8%
94,8%
4,0%
LSTM
97,1%
92,2%
4,9%
PhoBert
96,0%
92,0%
4,0%
SBert
92,0%
82,0%
10,0%
The evaluation results in Table 9 show that the models achieve promising
results on the test set. In which, the SVM model achieved the best accuracy results
on both User test and ASR test sets with results of 98.8% and 94.8%, respectively.
Classification models using deep neural networks also give very good results, with
an accuracy of 92.0% on the ASR test set with PhoBert and 92.2% with LSTM.
The Random Forest model's evaluation results on both test sets are also close to
that of SVM, reaching 98.3% on the evaluation set with User test and 94.8% on
the ASR test set. Although Sbert demonstrated good results on the User test set
with an accuracy of 92%, the model was quite sensitive to the output of the ASR
module, which only achieved a rather good result of 82% on the ASR test set.
Table 9 also shows the influence of the ASR module on the results of the
question and answer models. Models when evaluated on the data set that are
transcriptions of the output of ASR all give lower results than the content data that
44
users actually say. These two results have a difference of 3.9 - 10%, in which the
Random Forest model has at least 3.9% difference between the two test sets, and
the Sbert model has much different results up to 10%. The evaluation on transcript
data is the basis for system implementers to evaluate the effectiveness of models
under the influence of the ASR module and select the appropriate model.
Based on Table 9, we can see that SVM achieves very positive results in
understanding user questions. The reason for this can be explained by the analysis
of the dataset. First of all, most of the questions in the dataset focus on specific
keywords and relate to concepts and content in the field of Digital Transformation.
SVM, a classifier machine learning model, usually performs well in dealing with
classification problems based on relatively simple and easily separable features.
Therefore, with clear keywords and concepts in the field of Digital
Transformation, SVM brings positive results. In addition, the BERT model, which
is based on a deep learning neural network, does not achieve better results than
SVM in this case. The main reason is that the terminology specific to the field of
Digital Transformation may not be generalized well during word extraction or
feature extraction. This leads to the loss of important information during data
processing. In particular, when the user's utterance is distorted due to the influence
of the ASR module, the loss of information increases.
Besides the quality of the predictive models, the prediction time and the size
of the model are also an important factor to choose the right model when deployed
in practice. The fact that a model has a short prediction time will help the system
give feedback to the user quickly, helping to optimize the user experience.
Therefore, a model with the fastest prediction time will be better. At the same time,
to accommodate millions of users, the system also needs to be scalable. This leads
to the smaller the size requirement of the model, the more advantages it will have
to scale. Therefore, in addition to the quality of the experimented models, the thesis
also considers the average prediction time and the size of these models. Details of
this information are described in Table 10.
Table 10 Size and average prediction time
Model
Size (MB)
Average
prediction time (s)
Random Forest
685,8
0,114
SVM
36
0,02
LSTM
9,8
0,08
PhoBert
515,6
0,02
SBert
515,1
0,05
45
The evaluation results in Table 10 show that the SVM model, PhoBert have
approximately the same average response time and are the fastest, only 0.02s for
each turn. However, PhoBert has a rather heavy model size, up to 515.6MB, due
to its complex model architecture. LSTM and SVM are both lightweight models
with 9.8MB and 36MB respectively, but LSTM has a longer processing time than
SVM, 0.08s and 0.02s respectively. The Sbert model has a relatively fast prediction
time of 0.05s, but the model size is quite large at 515.1MB.
The prediction time and size of the models are proportional to the parameters
used in that model. SVM is a linear classifier machine learning model, based on
finding the best linear boundary to separate data points of different classes in
feature space. SVM does not require many parameters compared to deep learning
models like BERT. Usually, the number of parameters in the SVM depends on the
number of features and the number of classes, and less on the data size. Therefore,
SVM usually has a smaller number of parameters than BERT. In contrast, BERT
is a deep learning neural network model with complex architecture and a large
number of parameters, up to millions of parameters. BERT uses many layers of
neurons and a large number of weights to learn complex semantic patterns in
linguistic data. This results in BERT having a much larger number of parameters
than SVM. Because of the large number of parameters, Bert requires more
computational resources and memory to train and deploy than SVM.
Based on the evaluation results in Table 9 and Table 10, it shows that SVM
is a suitable model to deploy the question and answer problem in practice, with an
accuracy of 94.8%, a fast prediction time of only 0.02s and the model size is quite
light 36MB.
Thus, the thesis has presented the experiments and evaluations of QA models
with Vietnamese data on the built Digital Transformation domain. Evaluation
results are the basis for researchers, as well as organizations and individuals
wishing to build a QA system with a basis to choose an appropriate model when
deploying the system in practice.
46
CONCLUSION AND FUTURE WORKS
1. Conclusion
In this study, with the goal of solving the Vietnamese Question Answering
problem, in which the speech input factor is considered, the thesis has published
the Vietnamese question and answer data set and implemented (1) proposed data
building process and (2) experiment and evaluating QA models based on built data.
The proposed process is based on the concept of similarity questions, which
include questions that can be answered in the same way. Accordingly, this process
consists of 2 steps: building training data and building test data. With the training
data, a Written Collect System is used to collect questions that are similar to the
original question, and then evaluate the data through ambiguity analysis combined
with manual evaluation. Test data collects transcript and speaker content, using a
manual assessment and Speech Collect System.
Besides, the applicability of the data is evaluated through the QA models.
Based on the built data, the thesis experiments the QA models in two main
directions: classifying documents and comparing the similarity between questions.
The models all have promising results on the test set, with an accuracy of 80-94%,
in which, the SVM model has the highest accuracy, and is also the model with a
light size 36MB and has a fast prediction time of 0.02s, suitable for implementing
the model in practice.
2. Thesis’s contributions
Through the presentation from the previous chapters, the thesis's
contributions include:
(i) proposed a process to build a combination of text data and audio data for
the question-answering problem. This process provides a guide that outlines the
steps taken to collect, process, and evaluate data for a question-answering problem,
with speech input. This reduces costs when building two types of data together,
and makes data construction more controlled, and more systematic,
(ii) published the QA dataset for transparency and reuse in the research
community. Future studies can use the dataset in developing and evaluating their
own QA model,
(iii) experimented with different QA models on the built dataset helps to
provide information on the effectiveness and accuracy of the evaluated models.
From there, further studies have a basis for comparison and selection of
appropriate methods in the future.
47
3. Future works
Despite the promising results, the study still has many limitations. The data
collection is limited to the initial questions, so the model is not yet able to answer
questions outside of that dataset. At the same time, data collection and evaluation
is a manual process, thus increasing the cost of building a QA system. In the future,
the thesis aims to be able to create original data from available documents, and at
the same time, partially automate and provide assessments to support data testers
to minimize costs in the data construction process.
48
REFERENCES
[1] F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, and T.-S. Chua, “Retrieving and
Reading: A Comprehensive Survey on Open-domain Question Answering.”
arXiv, May 08, 2021. Accessed: Apr. 02, 2023. [Online]. Available:
http://arxiv.org/abs/2101.00774
[2] Z. Huang et al., “Recent Trends in Deep Learning Based Open-Domain
Textual Question Answering Systems,” IEEE Access, vol. 8, pp. 94341
94356, 2020, doi: 10.1109/ACCESS.2020.2988903.
[3] Q. Jin et al., “Biomedical Question Answering: A Survey of Approaches and
Challenges.” arXiv, Sep. 08, 2021. doi: 10.48550/arXiv.2102.05281.
[4] D. Moldovan, M. Pasca, S. Harabagiu, and M. Surdeanu, “Performance Issues
and Error Analysis in an Open-Domain Question Answering System,” in
Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics, Philadelphia, Pennsylvania, USA: Association for Computational
Linguistics, Jul. 2002, pp. 3340. doi: 10.3115/1073083.1073091.
[5] R. Usbeck, A.-C. N. Ngomo, L. Bühmann, and C. Unger, “Hawk–hybrid
question answering using linked data,” in The Semantic Web. Latest Advances
and New Domains: 12th European Semantic Web Conference, ESWC 2015,
Portoroz, Slovenia, May 31June 4, 2015. Proceedings 12, Springer, 2015,
pp. 353368.
[6] C. Kwok, O. Etzioni, and D. S. Weld, “Scaling Question Answering to the
Web”.
[7] J. Kupiec, “MURAX: a robust linguistic approach for question answering
using an on-line encyclopedia,” in Proceedings of the 16th annual
international ACM SIGIR conference on Research and development in
information retrieval - SIGIR ’93, Pittsburgh, Pennsylvania, United States:
ACM Press, 1993, pp. 181190. doi: 10.1145/160688.160717.
[8] Z. Zheng, “AnswerBus question answering system,” in Proceedings of the
second international conference on Human Language Technology Research
-, San Diego, California: Association for Computational Linguistics, 2002,
pp. 399404. doi: 10.3115/1289189.1289238.
[9] D. Mollá, M. van Zaanen, and D. Smith, Named Entity Recognition for
Question Answering,” in Proceedings of the Australasian Language
Technology Workshop 2006, Sydney, Australia, Nov. 2006, pp. 5158.
Accessed: Apr. 02, 2023. [Online]. Available: https://aclanthology.org/U06-
1009
[10] M. Wang, A Survey of Answer Extraction Techniques in Factoid Question
Answering,” Comput. Linguist., vol. 1, no. 1.
49
[11] M. M. Soubbotin, “Patterns of Potential Answer Expressions as Clues to the
Right Answers”.
[12] D. Ravichandran and E. Hovy, “Learning surface text patterns for a Question
Answering System,” in Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, Philadelphia, Pennsylvania, USA:
Association for Computational Linguistics, Jul. 2002, pp. 4147. doi:
10.3115/1073083.1073092.
[13] R. Sun, J. Jiang, Y. F. Tan, H. Cui, T.-S. Chua, and M.-Y. Kan, “Using
Syntactic and Semantic Relation Analysis in Question Answering”.
[14] D. Shen, G.-J. M. Kruijff, and D. Klakow, “Exploring Syntactic Relation
Patterns for Question Answering,” in Second International Joint Conference
on Natural Language Processing: Full Papers, 2005. doi:
10.1007/11562214_45.
[15] T. Kočiský et al., The NarrativeQA Reading Comprehension Challenge.”
arXiv, Dec. 19, 2017. Accessed: Apr. 08, 2023. [Online]. Available:
http://arxiv.org/abs/1712.07040
[16] T. Kwiatkowski et al., “Natural Questions: A Benchmark for Question
Answering Research,” Trans. Assoc. Comput. Linguist., vol. 7, pp. 452466,
2019, doi: 10.1162/tacl_a_00276.
[17] W. He et al., “DuReader: a Chinese Machine Reading Comprehension
Dataset from Real-world Applications,” in Proceedings of the Workshop on
Machine Reading for Question Answering, Melbourne, Australia: Association
for Computational Linguistics, Jul. 2018, pp. 3746. doi: 10.18653/v1/W18-
2605.
[18] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent
neural network architectures for large scale acoustic modeling,” 2014.
[19] M. E. Peters et al., “Deep contextualized word representations.” arXiv, Mar.
22, 2018. Accessed: Apr. 09, 2023. [Online]. Available:
http://arxiv.org/abs/1802.05365
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language Models are Unsupervised Multitask Learners”.
[21] K. M. Hermann et al., Teaching Machines to Read and Comprehend.” arXiv,
Nov. 19, 2015. Accessed: Apr. 09, 2023. [Online]. Available:
http://arxiv.org/abs/1506.03340
[22] P. Bajaj et al., “MS MARCO: A Human Generated MAchine Reading
COmprehension Dataset.” arXiv, Oct. 31, 2018. doi:
10.48550/arXiv.1611.09268.
[23] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “RACE: Large-scale ReAding
Comprehension Dataset From Examinations.” arXiv, Dec. 05, 2017. doi:
10.48550/arXiv.1704.04683.
50
[24] P. Rajpurkar, R. Jia, and P. Liang, “Know What You Don’t Know:
Unanswerable Questions for SQuAD.” arXiv, Jun. 11, 2018. doi:
10.48550/arXiv.1806.03822.
[25] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional Attention
Flow for Machine Comprehension.” arXiv, Jun. 21, 2018. Accessed: Apr. 09,
2023. [Online]. Available: http://arxiv.org/abs/1611.01603
[26] A. W. Yu et al., “QANet: Combining Local Convolution with Global Self-
Attention for Reading Comprehension.” arXiv, Apr. 23, 2018. doi:
10.48550/arXiv.1804.09541.
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding.” arXiv, May
24, 2019. doi: 10.48550/arXiv.1810.04805.
[28] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at
Scale,” in Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, Online: Association for Computational
Linguistics, Jul. 2020, pp. 84408451. doi: 10.18653/v1/2020.acl-main.747.
[29] C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer.” arXiv, Jul. 28, 2020. doi:
10.48550/arXiv.1910.10683.
[30] K. Van Nguyen, D.-V. Nguyen, A. G.-T. Nguyen, and N. L.-T. Nguyen, “A
Vietnamese Dataset for Evaluating Machine Reading Comprehension.”
arXiv, Nov. 07, 2020. doi: 10.48550/arXiv.2009.14725.
[31] K. Van Nguyen, T. Van Huynh, D.-V. Nguyen, A. G.-T. Nguyen, and N. L.-
T. Nguyen, “New Vietnamese Corpus for Machine Reading Comprehension
of Health News Articles.” arXiv, Feb. 11, 2021. doi:
10.48550/arXiv.2006.11138.
[32] M. Caballero, “A Brief Survey of Question Answering Systems,” Int. J. Artif.
Intell. Appl. IJAIA, vol. 12, no. 5, 2021.
[33] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic Parsing on Freebase
from Question-Answer Pairs,” in Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing, Seattle, Washington,
USA: Association for Computational Linguistics, Oct. 2013, pp. 15331544.
Accessed: Apr. 09, 2023. [Online]. Available: https://aclanthology.org/D13-
1160
[34] A. Talmor and J. Berant, “The Web as a Knowledge-Base for Answering
Complex Questions,” in Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana:
Association for Computational Linguistics, Jun. 2018, pp. 641651. doi:
10.18653/v1/N18-1059.
51
[35] A. Frank et al., “Question answering from structured knowledge sources,” J.
Appl. Log., vol. 5, no. 1, pp. 2048, Mar. 2007, doi:
10.1016/j.jal.2005.12.006.
[36] F.-L. Li, W. Chen, Q. Huang, and Y. Guo, “AliMe KBQA: Question
Answering over Structured Knowledge for E-commerce Customer Service.”
arXiv, Dec. 11, 2019. Accessed: Apr. 09, 2023. [Online]. Available:
http://arxiv.org/abs/1912.05728
[37] T. T. Phan, T. C. Nguyen, and T. N. T. Huynh, “Question Semantic Analysis
in Vietnamese QA System,” in Advances in Intelligent Information and
Database Systems, N. T. Nguyen, R. Katarzyniak, and S.-M. Chen, Eds., in
Studies in Computational Intelligence. Berlin, Heidelberg: Springer, 2010,
pp. 2940. doi: 10.1007/978-3-642-12090-9_3.
[38] D. Q. Nguyen, D. Q. Nguyen, and S. B. Pham, “Ripple Down Rules for
Question Answering,” Semantic Web, vol. 8, no. 4, pp. 511532, Jan. 2017,
doi: 10.3233/SW-150204.
[39] X. Yao and B. Van Durme, “Information Extraction over Structured Data:
Question Answering with Freebase,” in Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Baltimore, Maryland: Association for Computational Linguistics,
Jun. 2014, pp. 956966. doi: 10.3115/v1/P14-1090.
[40] J. Berant and P. Liang, “Semantic Parsing via Paraphrasing,” in Proceedings
of the 52nd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Baltimore, Maryland: Association for
Computational Linguistics, Jun. 2014, pp. 14151425. doi: 10.3115/v1/P14-
1133.
[41] D. Bhardwaj et al., “Question answering system for frequently asked
questions,” in Proceedings of the final workshop, 2016, p. 129.
[42] D. Wang and E. Nyberg, “CMU OAQA at TREC 2017 LiveQA: A Neural
Dual Entailment Approach for Question Paraphrase Identification”.
[43] T. M. Thai, N. H.-T. Chu, A. T. Vo, and S. T. Luu, “UIT-ViCoV19QA: A
Dataset for COVID-19 Community-based Question Answering on
Vietnamese Language.” arXiv, Sep. 14, 2022. Accessed: Apr. 09, 2023.
[Online]. Available: http://arxiv.org/abs/2209.06668
[44] Y.-S. Chuang, C.-L. Liu, H.-Y. Lee, and L. Lee, “SpeechBERT: An Audio-
and-text Jointly Learned Language Model for End-to-end Spoken Question
Answering.” arXiv, Aug. 11, 2020. Accessed: Apr. 09, 2023. [Online].
Available: http://arxiv.org/abs/1910.11559
[45] V. H. Tiệp, “Machine learning bản,” Nhà Xut Bn Khoa Hc K
Thut, 2018.
[46] L. Breiman, “Random forests,” Mach. Learn., vol. 45, pp. 532, 2001.
52
[47] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, pp. 81106,
1986.
[48] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20,
pp. 273297, 1995.
[49] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 17351780, 1997.
[50] Q. Chen, Z. Zhuo, and W. Wang, “Bert for joint intent classification and slot
filling,” ArXiv Prepr. ArXiv190210909, 2019.
[51] A. Aizawa, “An information-theoretic perspective of tf–idf measures,” Inf.
Process. Manag., vol. 39, no. 1, pp. 4565, 2003.
[52] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word
representations in vector space,” ArXiv Prepr. ArXiv13013781, 2013.
[53] “Frequently Asked Questions - Microsoft Download Center.
https://www.microsoft.com/en-us/download/faq.aspx (accessed Apr. 17,
2023).
[54] D. Bhardwaj, P. Pakray, J. Bentham, S. Saha, and A. Gelbukh, “Question
Answering System for Frequently Asked Questions,” in EVALITA.
Evaluation of NLP and Speech Tools for Italian, P. Basile, F. Cutugno, M.
Nissim, V. Patti, and R. Sprugnoli, Eds., Accademia University Press, 2016,
pp. 129133. doi: 10.4000/books.aaccademia.1975.
[55] D. Q. Nguyen and A. T. Nguyen, “PhoBERT: Pre-trained language models
for Vietnamese,” ArXiv Prepr. ArXiv200300744, 2020.